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Abstract. We describe a methodology for designing efficient parallel and distributed scientific 
software. This methodology utilizes sequences of mechanizable algebra— based optimizing transfor- 
mations. In this study, we apply our methodology to the FFT, starting from a high-level algebraic 
algorithm description. Abstract multiprocessor plans are developed and refined to specify which 
computations are to be done by each processor. Templates are then created that specify the loca- 
tions of computations and data on the processors, as well as data flow among processors. Templates 
are developed in both the MPI and OpenMP programming styles. 

Preliminary experiments comparing code constructed using our methodology with code from 
several standard scientific libraries show that our code is often competitive and sometimes performs 
better. Interestingly, our code handled a larger range of problem sizes on one target architecture. 
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1. Introduction. — 

We present a systematic, algebraically based, design methodology for efficient 
implementation of computer programs optimized over multiple levels of the proces- 
sor/memory and network hierarchy. Using a common formalism to describe the prob- 
lem and the partitioning of data over processors and memory levels allows one to 
mathematically prove the efficiency and correctness of a given algorithm as measured 
in terms of a set of metrics (such as processor/network speeds, etc.). The approach 
allows the average programmer to achieve high-level optimizations similar to those 
used by compiler writers (e.g. the notion of tiling). 

The approach is similar in spirit to other efforts using libraries of algorithm build- 
ing blocks based on C++ template classes. In POOMA for example, expression tem- 
plates using the Portable Expression Template Engine (PETE) 
('http://www.acl.lanl.gov.pete) were used to achieve efficient distribution of array in- 
dexing over scalar operations [T| [2l [3l l4l ISj [6l Fft [8l [9l fTO] . 

As another example. The Matrix Template Library (MTL) [TTJ [T^] is a system 
that handles dense and sparse matrices, and uses template meta-programs to generate 
tiled algorithms for dense matrices. 

For example: 



{A + B),j=Aj+B,j, (1.1) 

can be generalized to the situation in which the multi-dimensional arrays A and B 
are selected using a vector of indices v. 

In POOMA A and B were represented as classes and expression templates were 
used as re-write rules to efficiently carry out the translation to scalar operations 
implied by: 
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{A + B)s = Aij + Bj! 



(1.2) 



The approach presented in this paper makes use of A Mathematics of Arrays 
(MoA) and an indexing calculus (i.e. the "0 calculus) to enable the programmer to 
develop algorithms using high-level compiler-like optimizations through the ability to 
algebraically compose and reduce sequences of array operations. 

As such, the translation from the left hand side of Eg. 11.21 to the right side is just 
one of a wide variety of operations that can be carried out using this algebra. In the 
MoA formalism, the array expression in Eq. 11.21 would be written: 



where we have introduced the psi-operator tp to denote the operation of extracting an 
element from the multi-dimensional array using the index vector (v) . 

In this paper we give a demonstration of the approach as applied to the creation of 
efficient implementations of the Fast Fourier Transform (FFT) optimized over multi- 
processor, multi-memory/network, environments. Multi-dimensional data arrays are 
reshaped through the process of dimension lifting to explicitly add dimensions to 
enable indexing blocks of data over the various levels of the hierarchy. A sequence 
of array operations (represented by the various operators of the algebra acting on 
arrays) is algebraically composed to achieve the Denotational Normal Form (DNF). 
Then the V'-calculus is used to reduce the DNF to the ONF {Operational Normal 
Form) which explicitly expresses the algorithm in terms of loops and operations on 
indices. The ONF can thus be directly translated into efficient code in the language 
of the programmer's choice be it for hardware or software application. 

The application we use as a demonstration vehicle - the Fast Fourier Transform - 
is of significant interest in its own right. The Fast Fourier Transform (FFT) is one of 
the most important computational algorithms and its use is pervasive in science and 
engineering. The work in this paper is a companion to two previous papers [13j in 
which the FFT was optimized in terms of in-cache operations leading to factors of two 
to /owr speedup in comparison with our previous records. Further background material 
including comparisons with library routines can be found in Refs. [Ml [151 [HI [H] 
and [H]. 

Our algorithm can be seen to be a generalization of similar work aimed at out- 
of-corc optimizations [19] . Similarly, block decompositions of matrices (in general) 
are special cases of our rehape-traspose design. Most importantly, our designs are 
general for any partition size, i.e. not necessary blocked in squares, and any number 
of dimensions. Furthermore, our designs use linear transformations from an alge- 
braic specification and thus they are verified. Thus, by specifying designs (such as 
Gormen's and others ^19J) using these techniques, these designs too could be verified. 

The purpose of this paper IS NOT to attempt any serious analysis of the number 
of cache misses incurred by the algorithm in the spirit of of Hong and Kung and 
others [20l [211 [Hj. Rather, we present an algebraic method that achieves (or is 
competitive) with such optimizations meciianically. Through linear transformations 
we produce a normal form, the ONF, that is directly implementable in any hardware 
or software language and is realized in any of the processor/memory levels 23 . Most 
importantly, our designs are completely general in that through dimension lifting 
we can produce any number of levels in the processor/memory hierarchy. 



^■^(A + B) ^ vipA + vipB 



(1.3) 
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One objection to our approach is that one might incur an unacceptable perfor- 
mance cost due to the periodic rearrangement of the data that is often needed. This 
will not, however, be the case if we pre-fetch data before it is needed. The necessity 
to pre-fetch data also exists in other similar cache-optimized schemes. Our algorithm 
does what the compiler community calls tiling. Since we have analyzed the loop struc- 
tures, access patterns, and speeds of the processor/memory levels, pre- fetching be- 
comes a deterministic cost function that can easily be combined with reshape-transpose 
or tiling operations. 

Again we make no attempt to optimize the algorithm for any particular architec- 
ture. We provide a general algorithm in the form of an Operational Normal Form that 
allows the user to specify the blocking size at run time. This ONF therefore enables 
the individual user to choose the blocking size that gives the best performance for 
any individual machine, assuming this intentional information can be processed by a 
com pilefl 

It is also important to note the importance of running reproducible and determin- 
istic experiments. Such experiments are only possible when dedicated resources exist 
AND no interrupts or randomness effects memory /cache/communications behavior. 
This means that multiprocessing and time sharing must be turned off for both OS's 
and Networks. 

Conformal Computing is the name given by the authors to this algebraic ap- 
proach to the construction of computer programs for array-based computations in 
science and engineering. The reader should not be misled by the name. Conformal 
in this sense is not related to Conformal Mapping or similar constructs from math- 
ematics although it was inspired by these concepts in which certain properties are 
preserved under transformations. In particular, by Conformal Computing we mean 
a mathematical system that conforms as closely as possible to the underlying struc- 
ture of the hardware. Further details of the theory including discussion of MoA and 
V'-calculus are provided in the appendix. 

In this feasibility study, we proceed in a semi-mechanical fashion. At each step of 
algorithm development, the current version of the algorithm is represented in a code- 
like form, and a simple basic transformation is applied to this code-like form. Each 
transformation is easily seen to be mechanizable. We used the algebraic formalism 
described in Section 11.21 to verify that each transformation produces a semantically 
equivalent algorithm. At each step, we use judgment in deciding which transforma- 
tion to carry out. This judgment is based on an understanding of the goals of the 
transformations. 

The following is a more detailed description of our methodology, as carried-out 
in this feasibility study: 

1.1. Overview of Methodology. Phase 1. Obtain a high-level description 
of the problem. In this study, a MATLAB-like description is used. 

Phase 2. Apply a sequence of transformations to optimize the sequential ver- 
sion of the algorithm. The transformations used are of an algebraic nature, can be 
interpreted in terms of operations on whole arrays, and should not negatively effect 
subsequent parallelization. 

Phase 3. Apply a sequence of transformations to develop a parallel computation 
plan. Such plans consists of sequential code that indicates which parts of the overall 

^Processing intentional information will be the topic of a future paper 
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work are to be done by each individual processor. For each iteration of an outer loop, 
the plan specifies which iterations of inner loops should be performed by each of the 
processors in a multiprocessor system. 

Phase 4. Given a parallel computation plan, apply a sequence of transformations 
to produce a parallel computation template. Such templates specify a parallel or 
distributed version of the algorithm by indicating (1) which parts of the overall work 
are to be done by each individual processor, (2) where data is located, and (3) data 
movement among processors. Various parallel programming language styles can be 
used to express such templates. In this study, we use both a message passing per- 
processor style, motivated by MPI [ 24J , and an all-processor style, motivated by 
OpenMP [H]. 

There is also an implicit fifth phase, in which a parallel computation template is 
transformed into code expressed in a given programming language. 

In the future, we envision that scientific programs will be developed in interactive 
development environments. Such environments will combine human judgment with 
compiler-like analysis, transformation, and optimization techniques. A programmer 
will use knowledge of problem semantics to guide the selection of the transformation 
at each step, but the application of the transformation and verification of its safety 
will be done mechanically. A given sequence of transformations will stop at a point 
where a version of the code has been obtained such that subsequent optimizations 
can be left to an available optimizing compiler. This feasibility study is a first step 
towards the development of such an interactive program development environment. 

1.2. Program Development Via Use of Array and Indexing Algebra. 

Although not discussed further in this paper, Mullin's Psi Calculus [261 123 Ell ES] 
plays an underlying role in this feasibility study. This calculus provides a unified 
mathematical framework for representing and studying arrays, array operations, and 
array indexing. It is especially useful for developing algorithms centered around trans- 
formations involving array addressing, decomposition, reshaping, distribution, etc. 
Each of the algorithm transformations carried out here can be expressed in terms of 
the Psi Calculus, and we used the Psi Calculus to verify the correctness of program 
transformations in the development of our FFT algorithms. 

1.3. Related Results. A general algebraic framework for Fourier and related 
transforms, including their discrete versions, is discussed in [30]. As discussed in 
[31[ 132] and using this framework, many algorithms for the FFT can be viewed in 
terms of computing tensor product decompositions of the matrix B^, discussed in 
Section 12.11 below. Subsequently, a number of additional algorithms for the FFT 
and related problems have been developed centered around the use of tensor product 
decompositions ^331 131133 I3S ISZl I3H] ■ The work done under the acronym FFTW 
is based on a compiler that generates efficient sequential FFT code that is adapted 
to a target architecture and specified problem size [39l HQI IHl l42l l43]. A variety 
of techniques have been used to construct efficient parallel algorithms for the FFT 

mHSlllHlllIlllHlllS]. 

Previously, we manually developed several distributed algorithms for the FFT 
[50j . experimenting with variants of bit-reversal, weight computations, butterfly com- 
putations, and message generation mechanisms. In |50) , we report on the evaluation 
of twelve resulting variant programs for the one-dimensional FFT, evaluated by run- 
ning a consistent set of experiments. In |141 I15j . we compare the performance of 
our programs with that of several other FFT implementations (FFTW, NAG, ESSL, 
IMSL). As in [5T], we begin by developing sequential code for the radix 2 FFT, starting 
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from a high-level specification, via a sequence of optimizing algebra-based transfor- 
mations. This sequential code provides a common starting point for the development 
of sequential code for the general radix FFT in [5T] , and for the development of par- 
allel/distributed algorithms presented here. For the convenience of the reader and to 
provide context for the transformations used here, we repeat this common develop- 
ment here in Section [3l Similarly, for the convenience of the reader, we also recall 
relevant experimental results from |51| in Section [7l These experiments provide ev- 
idence that the design methodology outlined here can already produce competitive 
code. 

2. High— Level Description of Algorithm. 

2.1. Basic Algorithm. Our approach to algorithm development begins by ex- 
pressing the algorithm in a suitable high-level mechanism. This specification may 
take the form of a high-level statement of the algorithm, possibly annotated with 
additional specifications, such as the specification of the array of weights for the FFT. 
Our design and development of efhcient implementations of an algorithm for the FFT 
began with Van Loan's [52] suitably commented high-level MATLAB program for the 
radix 2 FFT, shown in Figure [2TT1 



Input: X in C" and n = 2*, where t > 1 is an integer. 
Output: The FFT of x. 

x^E„x (1) 

for q 1 to t (2) 

begin (3) 

L^2i (4) 

r ^ n/L (5) 

XLxr ^ Bl XLxr (6) 

end (7) 
Here, P„ is a n x n permutation matrix. Letting — L/2, and ujl 

II, ^l. 



IL, 



be the L'th root of unity, matrix Bl = 
is a diagonal matrix with values 1,ujl, ■ 

Fig. 2.1. High-level algorithm for the radix 2 FFT 



where Ql, 



, Lu^' ^ along the diagonal. 



In Line 1, P„ is a permutation matrix that performs the bit-reversal permutation 
on the n elements of vector x. The reference to XLxr in Line 6 can be regarded 
as reshaping the n element array x to be a L x r matrix consisting of r columns, 
where each column can be viewed as a vector of L elements. This reshaping of x is 
column-wise, so that each time Line 6 is executed, each pair of adjacent columns of 
the preceding matrix are concatenated to produce a single column of the new matrix. 
Line 6 can be viewed as treating each column of this matrix as a pair of vectors, each 
with L/2 elements, and doing a butterfly computation that combines the two L/2 
element vectors in each column to produce a vector with L elements. The butterfly 
computation, corresponding to multiplication of the data matrix x by Bl, combines 
each pair of L/2 element column vectors from the old matrix into a new L element 
vector for each column of the new matrix. 

2.2. Components of the Basic Algorithm. By inspection of Figure 12. 11 one 
can identify the following five components of the high-level radix 2 FFT algorithm: 
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1. The computation of the bit-reversal permutation (Line 1). 

2. The computation of the complex weights occurring in the matrices £7^, . These 
weights are discussed further in Section [3.1l 

3. The butterfly computation that, using the weights, combines two vectors from 
X, producing a vector with twice as many elements (Line 6). The butterfly 
computation is scalarized, and subsequently refined in Sections 13. 2H3. 71 
Contiguous memory access gives better performance at all levels of a memory 
hierarchy. Consequently, making data access during the butterfly compu- 
tation contiguous is a driving factor throughout this paper in refining the 
butterfly computation. 

4. The strategy for allocating the elements of the reshaped array x (line 6) 
to the processors for use in parallel and distributed implementations of the 
computation loop (the q loop of Lines 2 through 7). Alternative strategies 
for which processor will compute which values of x during each iteration of 
the computation loop are discussed in Section |4l The actual location of the 
data is discussed in Sections [5] and [6l 

5. The generation and bundling of messages involving values from x (Lines 2 
through 7). Message passing of data in distributed computation is discussed 
in Section O 

3. Development of Sequential Code. 

3.1. Specification of the Matrix of Weight Constants. The description of 
the algorithm in Figure 12.11 uses English to describe the constant matrix Bj^ . This 
matrix is a large two-dimensional sparse array. Of more importance for our purposes, 
this array can be naturally specified by using constructors from the array calculus with 
a few dense one dimensional vectors as the basis. This yields a succinct description of 
how Bl can be constructed via a composition of array operations. Indeed, B^ is never 
actually materialized as a dense L x L matrix. Rather, the succinct representation is 
used to guide the generation of code for the FFT, and the generated code only has 
multiplications corresponding to multiplication by a nonzero element oi B^. 

On the top-level, B^ is constructed by the appropriate concatenation of four 
submatrices. Only the construction of two of these submatrices, namely /^^ and 
^L,, need be separately specified. Matrix 1^^ occurs twice in the decomposition of 
B^. Matrix —^^l, can be obtained from matrix by applying element -wise unary 
minus. 

Each of the two submatrices to be constructed is a diagonal matrix. For each of 
these diagonal matrices, the values along the diagonal are successive powers of a given 
scalar. Psi Calculus contains a constructor that produces a vector whose elements are 
consecutive multiples or powers of a given scalar value. There is another constructor 
diagonaUz^l that converts a vector into a diagonal matrix with the values from the 
vector occurring along the diagonal, in the same order. We specified the matrices II, 
and CIl^ by using the vector constructor that produces successive powers, followed by 
the diagonalize constructor. 

Since L = 2*?, matrix B^ is different for each iteration of the loop in Figure [2Tl 
Accordingly, the specification of 1^^ and is parameterized by L (and hence im- 
plicitly by q). 



The diagonalize operation is itself specified as a composition of more primitive array opera- 
tions. 
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3.2. Scalarization of the Matrix Multiplication. A direct and easily au- 
tomated scalarization of the matrix multiplication in Line 6 of Figure 12.11 produced 
code similaiQ to that given in Figure 13.11 Here weight is a vector of consecutive 
powers of Note that in Figure [3TT1 the multiplication by the appropriate constant 
from the diagonal of matrix is done explicitly, using a value from vector weight, 
and the multiplication by the constant 1 from the diagonal of matrix 1^^ is done 
implicitly. Because the scalarized code does an assignment to one element at a time, 
the code stores the result of the array multiplication into a temporary array xx, and 
then copies xx into x. The sparseness of matrix is reflected in the assignments to 
XX (row, col). The right-hand side of these assignment statements is the sum of only 
the two terms corresponding to the two nonzero elements of the row of Bj^ involved in 
the computation of the left-hand side element. Moreover, the "regular" structure of 
Si, expressed in terms of diagonal submatrices, provides a uniform way of selecting 
the two terms. 

do col = 0,r-l 
do row = 0,L-1 

if (row < L/2 ) then 

xx(row,col) = x(row,col) + weight(row)*x(row+L/2,col) 
else 

xx(row,col) = x(row-L/2 , col) - weight(row-L/2)*x(row,col) 
end if 
end do 
end do 

Fig. 3.1. Direct Scalarization of the Matrix Multiplication 

3.3. Representing Data as a 1 Dimensional Vector. The data array x in 
Figure BTT] is a 2-dimensional array, that is reshaped during each iteration. Computa- 
tionally, however, it is more efficient to store the data values in a; as a one dimensional 
vector, and to completely avoid performing this reshaping. Psi calculus easily han- 
dles avoiding array reshaping, and automates such transformations as the mapping of 
indices of an element of the 2-dimensional array into the index of the corresponding 
element of the vector. The L x r matrix XLxr is stored as an n-element vector, which 
we denote as program variable x. The elements of the two-dimensional matrix are en- 
visioned as being stored in column-major order, reflecting the column-wise reshaping 
occurring in Figure [2T] Thus, element XLxrirow, col) of XLxr corresponds to element 
x(L*col+row) of X. Consequently, when L changes, and matrix XLxr is reshaped, no 
movement of data elements of vector x actually occurs. Replacing each access to 
an element of the two dimensional matrix XLxr with an access to the corresponding 
element of the vector x, produces the code shown in Figure [3?2l 

As an alternative to the scalarized code of Figure 13. 2[ the outer loop variable 
can iterate over the starting index of each column in vector x, using the appropriate 
stride to increment the loop variable. Instead of using a loop variable col, which 
ranges from to r-1 with a stride of 1, we use a variable, say col', such that col' 
= L * col, and which consequently has a stride of L. By doing this, we eliminate 
the multiplication L * col that occurs each time an element of the arrays xx or x is 
accessed. This form of scalarization produces the code shown in Figure 13.31 

^ The data array was not reshaped for each value of q, so references to the data array elements 
was more complicated than shown in Figure [STI 
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do col = 0,r-l 
do row = 0,L-1 

if (row < L/2 ) then 

xx(L*col+row) = x(L*col+row) + weight (row) *x(L*col+row+L/2) 
else 

xx(L*col+row) = x(L*col+row-L/2) - weight(row-L/2)*x(L*col+row) 
end if 
end do 
end do 

Fig. 3.2. Scalarized Code with Data Stored in a Vector 

do col' = 0,n-l,L 
do row = 0,L-1 

if (row < L/2 ) then 

xx(col'+row) = x(col'+row) + weight (row) *x(col'+row+L/2) 
else 

xx(col'+row) = x(col'+row-L/2) - weight (row-L/2) *x(col'+row) 
end if 
end do 
end do 

Fig. 3.3. Striding through the Data Vector 



3.4. Elimination of Conditional Statement. The use of the conditional 
statement that tests row < L/2 in the above code can be eliminated automatically, 
as follows. We first re-tile the loop structure so that the innermost loop iterates over 
each pair of data elements that participate in each butterfly combination. To accom- 
plish this re-tiling, we first envision reshaping the two-dimensional array XLxr into 
a three-dimensional array xi^/2x2xr- Under this reshaping, element x^xrifow, col) 
corresponds to element XLf2x2xr(j'owm.odL/2,lrow/{L/2)\,col). The middle di- 
mension of Xj^i2xixr splits each column of x^xr into the upper and lower parts of 
the column. Scalarizing Line 6 of Figure 12.11 based on the three-dimensional array 
XL/2x2xri E^nd indexing over the third dimension in the outer loop, over the first di- 
mension in the middle loop, and over the second dimension in the innermost loop, 
produces the code shown in Figure [Ql 

To eliminate the conditional statement, we unroll the innermost loop, and produce 
the code shown in Figure [5751 

When we represent the data array as a one-dimensional vector, the code shown 
in Figure [371] is produced. 

3.5. Optimizing the Basic Block. The basic block in the inner loop of the 
above code has two occurrences of the common subexpression weight (row) *x(col'+row+L/2) . 
We hand-optimized this basic block, to compute this common subexpression only 

once. This produces more efficient code for the basic block, as shown in Figure [5771 

Incorporating all the transformations described so far, the loop in Figure 12.11 is 
scalarized as shown in Figure 13.81 In this code, pi and i are Fortran "parameters" , 
i.e., named constants. 

3.6. Doing the Butterfly Computation In— Place. The use of the temporary 
array xx in the butterfly computation is unnecessary, and can be avoided by the 
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do col = 0,r-l 
do row = 0,L/2-l 
do group =0,1 

if ( group == ) then 
xx(row, group, col) = 

x(row, group, col) + weight(row)*x(row,group+l,col) 

else 

xx(row, group, col) = 

x(row,group-l,col) - weight (row) *x (row, group, col) 

end if 
end do 
end do 
end do 

Fig. 3.4. Re-tiled Loop 

do col = 0,r-l 
do row = 0,L/2-l 

xx(row,0,col) = x(row,0,col) + weight (row) *x(row, 1 , col) 
xx(row, 1 , col) = x(row,0,col) - weight (row) *x (row, 1 , col) 
end do 
end do 

Fig. 3.5. Unrolled Inner Loop 



use of a scalar variable to hold the value of x(col'+row). Code incorporating this 
modification is shown in Figure [2111 where scalar variable d is used for this purpose. 



3.7. Vectorizing the Butterfly Computation. An alternative coding style 
for the butterfly computation is to use monolithic vector operations applied to ap- 
propriate sections of the data array. This vectorization of the butterfly computation 
produces the code shown in Figure [3.ldp . Here, cvec is a one-dimensional array used 



to store the vector of values assigned to variable c during the iterations of the inner 
loop (the row loop) in Figure [321 Similarly, dvec is a one-dimensional array used 
to store the vector of values assigned to variable d during the iterations of this inner 
loop. 

4. Development of Plans for Parallel and Distributed Code. 

4.1. An Overview of Plans, Templates, and Architectural Issues. We 

want to generate code for two types of architecture. 

• Distributed memory architecture (where each processor has its own local 
address space). 

• Shared memory architecture with substantial amounts of local memory (where 
there is a common address space that includes the local memories). 

To accommodate these architectures, we will develop an appropriate parallel com- 
putation plan, followed by appropriate parallel computation templates. 

What we mean by a parallel computation plan is sequential code that indicates 
which parts of the overall work is to be done by each individual processor. A plan 



^We assume that code can select a set of array elements using a start, stop, and stride mechanism, 
with a default stride of 1. 
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do col' = 0,n-l,L 
do row = 0,L/2-l 

xx(col'+row) = x(col'+row) + weight (row) *x(col'+row+L/2) 
xx(col'+row+L/2) = x(col'+row) - weight (row) *x(col'+row+L/2) 
end do 
end do 

Fig. 3.6. Unrolled Inner Loop With Data Stored as a Vector 

c = weight (row) *x(col'+row+L/2) 
xx(col'+row) = x(col'+row) + c 
xx(col'+row+L/2) = x(col'+row) - c 

Fig. 3.7. Optimized Basic Block 

specifies for eacfi iteration of an outer loopH, which iterations of inner loops should be 
performed by each of the processors of a multiprocessor system. At each point in the 
parallel computation, we envision that responsibility for the n elements is partitioned 
among the processors. Our intention is that this responsibility is based on an owner 
computes rule, namely a given processor is responsible for those data elements for 
which it executes the butterfly assignment. 

What we mean by a template is distributed or parallel "generic code" that in- 
dicates not only which parts of the overall work is to be done by each individual 
processor, but also where data is located, and how data is moved. To convert a tem- 
plate into code, a specific programming language needs to be selected, and more detail 
needs to be filled in. We use two styles of templates. The first style is a per-processor 
style, motivated by MPI [24]. The other style is an all-processor style, motivated by 
OpenMP [25]. 

It is our intention that in the transformation from an FFT parallel computation 
plan into a template for a distributed architecture or a shared memory architecture 
where substantial local memory is available to each processor, each of the processors 
will hold in its local memory those data elements it is responsible for, and do the 
butterfly operations on these data elements. 

4.2. Splitting the Outer Loop of the Butterfly Computation. Suppose 
there are m processors, where m is a power of 2. We let psize denote n/m, the number 
of elements that each processor is responsible for. Our parallel computation plans 
will assume that psize > m. Now envision the data in the form of a two-dimensional 
matrix that gets reshaped for each value of g, as in Figure 12.11 During the initial 
rounds of the computation, for each processor, the value of psize is large enough to 
hold one or more columns of the two-dimensional data matrix, so we will make each 
processor responsible for multiple columns, where the total number of elements in 
these columns equals psize. For each iteration of g, the length of a column doubles, 
so that at some point, a column of the matrix will have more than psize elements. 
This first occurs when q equals logj (psize) -I- 1. We call this value breakpoint. Once 
q equals breakpoint, each column has more than psize elements. Consequently, we 
no longer want only one processor to be responsible for an entire column, and so we 
change plans. Before q equals breakpoint, the number of columns is a multiple of m. 



For the FFT, this outer loop is the q loop of Figure |3^ 
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do q = 1 , t 
L = 2**q 
do row = 0,L/2-l 

weight (row) = EXP((2*pi*i*row)/L) 
end do 

do col' = 0,n-l,L 
do row = 0,L/2-l 

c = weight (row) *x(col'+row+L/2) 
xx(col'+row) = x(col'+row) + c 
xx(col'+row+L/2) = x(col'+row) - c 
end do 
end do 

X = XX 

end do 

Fig. 3.8. Loop with Optimized Basic Block 

do q = 1 , t 
L = 2**q 
do row = 0,L/2-l 

weight (row) = EXP((2*pi*i*row)/L) 
end do 

do col' = 0,n-l,L 
do row = 0,L/2-l 

c = weight (row) *x(col'+row+L/2) 
d = x(col'+row) 
x(col'+row) = d + c 
x(col'+row+L/2) = d - c 
end do 
end do 
end do 

Fig. 3.9. Loop with In-Place Butterfly Computation 



From breakpoint on, there are fewer than m columns, but the number of rows is a 
multiple of m. 

As long as q is less than breakpoint, we can use a block approach to the compu- 
tation, with each processor computing the new values of several consecutive columns 
of the two-dimensional matrix. Assume that the processors are numbered through 
m — 1. In terms of the one-dimensional vector of data elements, the columns whose 
new values are to be computed by processor p are the psize consecutive vector ele- 
ments beginning with vector element p* psize. When q equals breakpoint, the number 
of elements in each column is 2 * psize, whereas we want each processor to compute 
the new value of only psize elements. Thus, when q equals breakpoint, we need to 
switch to a different plan. To facilitate the switch to a different plan, we first modify 
Figure [3?9l by splitting the q loop into two separate loops, as shown in Figure [4T] 

4.3. Parallel Computation Plan When Local Memory is Available. Con- 
sider the two q loops in Figure |4?T1 For the first q loop, we will use a block approach 
to splitting the computation among the processors, as discussed in Section !?^ Recall 
that before q equals breakpoint, the number of columns is a multiple of m, and from 
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do col' = 0,n-l,L 

cvec(0:L/2-l) = weight (0 : L/2-1) * x(col'+L/2 : col'+L-l) 

dvec(0:L/2-l) = x(col': col'+L/2-l) 

x(col':col'+L/2-l) = dvec (0 : L/2-1) + cvec (0 : L/2-1) 

x(col'+L/2: col'+L-l) = dvec (0 : L/2-1) - cvec (0 : L/2-1) 
end do 



Fig. 3.10. Vectorizing the Butterfly Computation 

do q = 1, breakpoint - 1 
L = 2**q 
do row = 0, L/2-1 

weight (row) = EXP ( (2*pi*i*row) /L) 
end do 

do col' = 0,n-l,L 
do row = 0, L/2-1 

c = weight (row) *x(col'+row+L/2) 
d = x(col'+row) 
x(col'+row) = d + c 
x(col'+row+L/2) = d - c 
end do 
end do 
end do 

do q = breakpoint, t 
L = 2**q 
do row = 0, L/2-1 

weight (row) = EXP((2*pi*i*row)/L) 
end do 

do col' = 0,n-l,L 
do row = 0, L/2-1 

c = weight (row) *x(col'+row+L/2) 
d = x(col'+row) 
x(col'+row) = d + c 
x(col'+row+L/2) = d - c 
end do 
end do 
end do 

Fig. 4.1. Loop Splitting the Butterfly Computation 



breakpoint on, the number of rows is a multiple of to. For the second q loop, which 
begins with q equal to breakpoint, we use a cyclic approach, with each processor p 
computing the new values of all the rows j of the two-dimensional matrix such that 
j is congruent to p mod to. This corresponds to all the iterations of the inner loop 
where program variable row is congruent to p mod to. 

We express a computation plan by modifying each of the q loops in Figure l4?l] so 
that for each value of q, there is a new outer loop using a new loop variable p, which 
ranges between and to — 1. Our intention is that all the computations within the 
loop for a given value of p will be performed by processor p. Figure 14.21 shows the 
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effect of restructuring each of the two q loops in Figure WA\ by introducing a p loop to 
reflect the parallel computation plan. The first q loop uses a block approach for each 
p, and the second q loop uses a cyclic approach for each p. 

do q = 1, breakpoint - 1 
L = 2**q 
do row = 0,L/2-l 

weight (row) = EXP((2*pi*i*row)/L) 
end do 

do p = 0,m-l 

do col' = p*psize , (p+1) *psize-l ,L 
do row = 0,L/2-l 

c = weigfit (row) *x(col'+row+L/2) 
d = x(col'+row) 
x(col'+row) = d + c 
x(col'+row+L/2) = d - c 
end do 
end do 
end do 
end do 

do q = breakpoint, t 
L = 2**q 
do row = 0,L/2-l 

weight (row) = EXP((2*pi*i*row)/L) 
end do 

do p = 0,in-l 

do col' = 0,n-l,L 
do row = p,L/2-l,m 

c = weight (row) *x(col'+row+L/2) 
d = x(col'+row) 
x(col'+row) = d + c 
x(col'+row+L/2) = d - c 
end do 
end do 
end do 
end do 

Fig. 4.2. Initial Parallel Computation Plan 

Each execution of a p loop in Figure consists of m iterations. Note that for 
each value of q, the sets of elements of the x vector accessed by these m iterations are 
pairwise disjoint. Our intention is that each of these m iterations will be performed 
by a different processor. Since these processors will be accessing disjoint sets of 
elements from the x vector, the execution of different iterations of ap loop by different 
processors can be done in parallel. 

Now consider the first q loop in Figure 14.21 For each value of p within this first q 
loop, the data elements accessed are the psize elements beginning at position p*psize 
of X. Since for a given p, these are the same set of data elements for every value 
of q in the outer loop, we can interchange the loop variables q and p in the first q 
loop. Now consider the second q loop in Figure 14.21 For each value of p within this 
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second q loop, the data elements accessed are the psize elements of x whose position 
is congruent to p mod m. Since for a given p, these are the same set of data elements 
for every value of q in the outer loop, we can interchange the loop variables q and p 
in the second q loop. This analysis shows that we can interchange the loop variables 
q and p in each of the two loops in Figure [42] to make p the outermost loop variable, 
resulting in Figure l473l Note that a consequence of this loop interchange is that the 
computation of the weight vector for each value of q is now done for each value of p. 

do p = 0,m-l 

do q = 1, breakpoint - 1 
L = 2**q 
do row = 0,L/2-l 

weight (row) = EXP((2*pi*i*row)/L) 
end do 

do col' = p*psize, (p+l)*psize-l ,L 
do row = 0,L/2-l 

c = weight (row) *x(col'+row+L/2) 
d = x(col'+row) 
x(col'+row) = d + c 
x(col'+row+L/2) = d - c 
end do 
end do 
end do 
end do 

do p = 0,m-l 

do q = breakpoint, t 
L = 2**q 
do row = 0,L/2-l 

weight (row) = EXP((2*pi*i*row)/L) 
end do 

do col' = 0,n-l,L 
do row = p,L/2-l,m 

c = weight (row) *x(col'+row+L/2) 
d = x(col'+row) 
x(col'+row) = d + c 
x(col'+row+L/2) = d - c 
end do 
end do 
end do 
end do 

Fig. 4.3. Local Memory Parallel Computation Plan with Processor Loops Outermost 



4.4. Computing Only Needed Weights. In the second q loop in Figure l4!3l 
each p computes an entire weight vector of L/2 elements. However, for a given p, 
only those elements in the weight vector whose position is congruent to p mod m are 
actually used. We can change the plan so that only these elements of the weight 
vector are computed. The computation of weights in the second q loop would then 
be as follows. 
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do row = p,L/2-l,m 

weight (row) = EXP((2*pi*i*row)/L) 
end do 

4.5. Making Weight Vector Access Contiguous. Note that in the second 
q loop, the weight vector is accessed with consecutive values of the program variable 
row. These consecutive values of row are strided, with a stride of m. Consequently, 
the accesses of the weight vector in the second q loop are strided, with a stride of m. 
Within each iteration of the second q loop we can make the accesses to the weight 
vector contiguous, as follows. The key is to use a new weight vector weightcyclic, 
in which we store only the weight elements needed for a given value of p. Con- 
sequently, these needed elements of weightcyclic will be accessed contiguously. The 
relationship between the weightcyclic and weight yectois is that for each j, < j < 
weightcyclic{j) will hold the value of weight{m * j + p), i.e., for each value of row 
occurring in an inner loop with loop control 

do row = p,L/2-l,m 
the value of weight(row) will be held in weightcyclic{ ^°^^^ ). 

With this change, the computation of weights in the second q loop becomes the 
following 

do row = p,L/2-l,m 

weightcyclic ( (row-p) /m) = EXP((2*pi*i*row)/L) 
end do 

We can simplify the loop control and subscript computation for weightcyclic by 
changing the loop control variable. In the loop which computes the needed weight 
elements, we use a loop index variable row', where row' — {row —p)/m, i.e. row = 
m * row' + p. 

With this change, the weight computation within the second q loop becomes the 
following, where row' has a stride of 1. 
do row' = 0,L/(2*m)-l,l 

weightcyclic (row') = EXP( (2*pi*i* (m*row'+p) ) /L) 
end do 

Within the butterfly computation, where the loop variable is row, a use of 
weight{row) becomes a use of weightcyclic{{row—p)/m). 

Figure 14.41 shows the second q loop after incorporating the improvements from 
Section 14.41 and this Section to the computation and access of weights. 

4.6. Making Data Access Contiguous. Note that the accesses of data vector 
X in the second q loop are strided, with a stride of m. We can make access to the 
needed data in x in the second q loop contiguous, as follows. We introduce a new 
data vector xcyclic to hold only those data elements accessed for each p. For each 
p, the vector xcyclic contains L/m elements from each of the n/L columns, for a total 
of n/m = psize elements. The relationship between the xcyclic and x vectors is that 
for each col' , where < col' < n — 1 and coV is a multiple of L, and row, where 
< row < L and row is congruent to p mod m, xcyclic{col' /m + (row — p) /m) will 
hold the value of x{col' + row). Consequently, for each value of col' and row occurring 
in the butterfly loop, x{col' + row) will be held in xcyclic{col' / m + [row —p)/m) and 
x{col' + row + L/2) will be held in xcyclic(col' /m + [row + L/2 — p)/m). 

At the beginning of each iteration of the p loop, the required psize elements of 
x must be copied into xcyclic, and at the end of each iteration of the p loop, these 
psize elements must be copied from xcyclic into x. At the template level, this copying 
will be done by either message passing or assignment statements, depending on the 
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do p = 0,m-l 

do q = breakpoint, t 
L = 2**q 

do row' = 0,L/(2*m)-l,l 

weightcyclic(row') = EXP((2*pi*i*(m*row'+p))/L) 
end do 

do col' = 0,n-l,L 
do row = p,L/2-l,m 

c = weightcyclic ( (row-p) /m) *x(col'+row+L/2) 
d = x(col'+row) 
x(col'+row) = d + c 
x(col'+row+L/2) = d - c 
end do 
end do 
end do 
end do 

Fig. 4.4. Second q Loop with Contiguous Access of Needed Weights 



architecture and programming language. Here, where we are developing a compilation 
plan, we indicate this copying abstractly, using function calls. We envision array x 
as containing a "centralized" copy of all the data elements, and xcycUc as containing 
a subset. The respective subsets of x used by the iterations of the p loop represent 
a partition of the data elements in x. We envision the n elements of x as being 
partitioned into m subsets, where each iteration of the p loop copies one of these 
subsets into xcyclic. The copying from x to xcyclic of the psize elements of x whose 
position in x is congruent to p mod m is expressed using the function call 
CENTRALIZED_TO_CYCLIC_PARTITIONED(x, xcyclic, m, psize, p). 
The copying of the psize elements of xcyclic into x is expressed using the function call 
CYCLIC_PARTITIONED_TD_CENTRALIZED (xcyclic, x,m, psize, p). 

Figure 14.51 shows the effect of incorporating these changes to the second q loop. 
Note that since the stride in the row loop is m, these changes do indeed provide 
contiguous data access. 

We can simplify the loop control and subscript computation of vector elements 
occurring in Figure [4751 by changing the loop control variables, as shown in Figure l4!6l 

In Figure [4?6l we use a variable col" to range over all the columns, where col' = 
m * col" i.e. col" — col' /m. Since the bounds for loop control variable col' in 
Figure [4751 are 0,n-l,L, the bounds for the loop control variable col" in Figure [476] 
are 0,psize-l,L/m. 

Similarly, in Figure [4761 we use a variable row', where row = m * row' + p, i.e. 
row' — {row —p)/m. Consider the bounds of the row loop in the second q loop; 
the start, stop, stride are p,L/2-l,m. For the corresponding row' loop, the start 

becomes 0, and the stop becomes ^ ^ii. Since < » < m, we can conclude that 

< < 1, SO the stop can be simplified to 2^ — 1. Finally, the step is 1. So, the 
bounds for the row' loop become ,L/ (2*m) -1 , 1. 

Consequently, xcyclic (col"+row') will contain the value of x(ni*col"+m*row'+p) . 

4.7. Facilitating Exploitation of Local Memory. Note that in Figure [4761 
there is no data flow involving xcyclic between the iterations of the p loop. A 
consequence is that when the plan is subsequently transformed into a template for 
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do p = 0,m-l 

CENTRALIZED _TO_CYCLIC_PARTITIONED (x , xcyclic , m , psize , p) 
do q = breakpoint ,t 
L = 2**q 

do row' = 0,L/(2*m)-l , 1 

weightcyclic(row') = EXP((2*pi*i*(m*row'+p))/L) 
end do 

do col' = 0,n-l,L 
do row = p,L/2-l,m 

c = weightcyclic ( (row-p) /m) *xcyclic(col'/m+(row-p)/m+L/(2*m)) 
d = xcyclic (col'/m+ (row-p) /m) 
xcyclic(col'/m+(row-p)/m) = d + c 
xcyclic (col'/m+ (row-p) /m+L/(2*m)) = d - c 
end do 
end do 
end do 

CYCLIC_PARTITIONED_TD_CENTRALIZED (xcyclic , x , m , psize , p) 
end do 

Fig. 4.5. Plan for Second q Loop with Contiguous Data Access 
do p = 0,m-l 

CENTRALIZED _T0_CYCL1C_PART1T10NED (x , xcyclic , m , psize , p) 
do q = breakpoint, t 
L = 2**q 

do row' = p,L/2-l,m 

weightcyclic (row') = EXP((2*pi*i*(m*row'+p))/L) 
end do 

do col" = 0,psize-l,L/m 
do row' = 0,L/(2*m)-l,l 

c = weightcyclic (row') *xcyclic (col"+row'+L/ (2*m) ) 
d = xcyclic (col"+row') 
xcyclic (col"+row') = d + c 
xcyclic (col"+row'+L/(2*m)) = d - c 
end do 
end do 
end do 

CYCLIC_PARTITIONED_TO_CENTRALIZED (xcyclic , x , m , psize , p) 
end do 

Fig. 4.6. Plan for Second q Loop with Contiguous Data Access and Simplified Loop Control 



parallel and distributed code, xcyclic can be a local variable of each processor. 

A similar transformation to that in Section 14.61 can be done to the second q 
loop in the plan, although the payoff only occurs subsequently when templates are 
constructed from the plan. We can facilitate making access to the data in the first 
q loop be via a processor-local variable. This can be accomplished by introducing a 
new data vector xblock, to hold the data values accessed by each iteration of the p 
loop. 

For each p, we will use the vector xblock to hold the psize elements accessed 
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during that iteration of the p loop. The relationship between the xblock and x vectors 
is that for each col' , where < col' < n — 1 and col' is a multiple of L, and row, 
where < row < L, x{col' + row) will be held in xhlock{col' + row — p * psize). 
Figure l477l shows the effect of incorporating the use of xblock in the first q loop. 

do p = 0,m-l 

CENTRALIZED _TO_BLOCK_PARTITIONED (x , xblock , m , psize , p) 
do q = 1, breakpoint - 1 

L = 2**q 

do row = 0,L/2-l 

weight (row) = EXP((2*pi*i*row)/L) 

end do 

do col' = p*psize, (p+l)*psize-l,L 
do row = 0,L/2-l 

c = weight (row) *xblock(col'+row+L/2-p*psize) 
d = xblock(col'+row-p*psize) 
xblock(col'+row-p*psize) = d + c 
xblock(col'+row+L/2-p*psize) = d - c 
end do 
end do 
end do 

BLOCK_PARTITIONED_TO_CENTRALIZED (xblock , x , m , psize , p) 
end do 

Fig. 4.7. Plan for First q Loop 

We can simplify the loop control and subscript computation of vector elements 
by changing the loop control variable col' , as shown in Figure [4?8l We use a variable 
col", where col" = col' — p * psize. So, the bounds for the loop control variable 
col" are 0,psize-l,L. Consequently, xblock(col"+row) will contain the value of 
x(col'+row-p*psize) . 

Figure 14.91 shows the final plan, combining Figures 14.61 and 14.81 

4.8. Flexibility in Setting Breakpoint. Note that the split of the q loop of 
Figure into two separate q loops, as shown in Figure HTTl is correct for any value 
of breakpoint. However, for the transformation into the plan shown in Figure 14.21 to 
be correct, breakpoint must satisfy certain requirements, as follows. 

Requirement 1: Consider the first q loop in Figure The correctness of this q 
loop requires that L < psize, i.e. 2'' < psize. Thus, each value of q for which the loop 
is executed must satisfy q < \og2(j>size). Since the highest value of q for which the loop 
is executed is breakpoint— 1, the first q loop requires that breakpoint < \og2{psize) + l. 

Requirement 2: Consider the second q loop in Figure [121 The correctness of this 
q loop requires that L/2 > m, i.e. 2"^^^ > m. Thus, each value of q for which the loop 
is executed must satisfy q > log2 (m) + 1 . Since the lowest value of q for which the loop 
is executed is breakpoint, the second q loop requires that breakpoint > log2(TO) + 1. 

Together, these two requirements imply that: 

log2(?TT.) + 1 < breakpoint < log2 (psize) + 1. 

Since we are assuming that m < psize, there is a range of possible allowable 
values for breakpoint. Thus far in Section 31 we have assumed that breakpoint is set 
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do p = 0,m-l 

CENTRALIZED_TO_BLOCK_PARTITIONED(x,xblock,m,psize,p) 
do q = 1, breakpoint - 1 

L = 2**q 

do row = 0,L/2-l 

weight (row) = EXP((2*pi*i*row)/L) 

end do 

do col" = 0,psize-l,L 
do row = 0,L/2-l 

c = weight (row) *xblock(col"+row+L/2) 
d = xblock(col"+row) 
xblock(col"+row) = d + c 
xblock(col"+row+L/2) = d - c 
end do 
end do 
end do 

BLDCK_PARTITIONED_TD_CENTRALIZED(xblock,x,m,psize,p) 
end do 

Fig. 4.8. Plan for First q Loop with Simplified Loop Control 



at the high end of its allowable range, namely at \og2{psize) + 1. In fact, breakpoint 
could be set anywhere within its allowable range. If the plan shown in Figure 14.41 is 
used, there is an advantage in setting breakpoint to the low end of its allowable range, 
for the following reason. Moving some values of q from the first q loop to the second q 
loop would reduce the number of weights that need to be computed by each processor. 

5. Development of Per Processor Templates for Distributed Memory 
Architectures . 

5.1. Distributed Template Using Local Copy of Data Vector. 

5.1.1. Partial Per Processor Distributed Code Template. Figure lSTTl shows 
how the parallel computation plan from Figure [4731 can be transformed into a partially 
specified template (subsequently abbreviated as a partial template) for distributed 
codeQ. The template in Figure [Ol is intended to represent per-processor code to be 
executed on each of the m processors. We assume that each processor has a variable, 
named myid, whose value is the processor number, where the processor number ranges 
from to TO — 1. Thus, each of the to processors, say processor i, has its value of 
local variable myid equal to i. In Figure 14. 3i each p loop has to iterations, with p 
ranging from to to — 1. In Figure [57T| variable myid is used to ensure that processor 
i executes iteration i of each p loop, i.e., the iteration where p equals i. 

A SYNCHRONIZE command is inserted between the two q loops to indicate that 
some form of synchronization between processors is needed because data computed 
by each processor in its first q loop is subsequently accessed by all the processors in 
their second q loop. We call Figure EH] a partial template because it does not address 
the issue of data location or data movement between processors. These issues must be 
resolved, and the SYNCHRONIZE command between the two q loops must be replaced 

Any of the improvements from Section [J] such as computing only the needed weights, as 
discussed in Section 14.41 could be incorporated into the plan prior to this transformation, but for 
simplicity, here we give a transformation from the plan in Figure 1431 
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do p = 0,m-l 

CENTRALIZED _TD_BLOCK_PARTITIONED (x , xblock , m , psize , p) 
do q = 1, breakpoint - 1 

L = 2**q 

do row = 0,L/2-l 

weight (row) = EXP( (2*pi*i*row)/L) 

end do 

do col" = 0,psize-l,L 
do row = 0,L/2-l 

c = weight (row) *xblock(col"+row+L/2) 
d = xblock(col"+row) 
xblock(col"+row) = d + c 
xblock(col"+row+L/2) = d - c 
end do 
end do 
end do 

BLOCK_PARTITIONED_TO_CENTRALIZED (xblock , x , m , psize , p) 
end do 

do p = 0,m-l 

CENTRALIZED_TO_CYCLIC_PARTITIONED (x , xcyclic ,m , psize , p) 
do q = breakpoint, t 
L = 2**q 

do row' = p,L/2-l,m 

weightcyclic(row') = EXP((2*pi*i*(m*row'+p))/L) 
end do 

do col" = 0,psize-l ,L/m 
do row' = 0,L/(2*m)-l,l 

c = weightcyclic (row') *xcyclic (col"+row'+L/ (2*m) ) 
d = xcyclic (col"+row') 
xcyclic (col"+row') = d + c 
xcyclic (col"+row'+L/(2*m)) = d - c 
end do 
end do 
end do 

CYCLIC_PARTITIONED_TO_CENTRALIZED (xcyclic , x , p ,in , psize) 
end do 

Fig. 4.9. Combined Plan Using Partitioned Data 



by statements using an appropriate implementation language mechanism that permits 
each processor to proceed past the synchronization point only after all the processors 
have reached the synchronization point. These statements must ensure that the data 
needed by each processor is indeed available to that processor. 

5.1.2. Responsibility Based Distribution of Data Vector on Local Mem- 
ories. In order to transform the partial template of Figure [5TT] into a (fully specified) 
template, we need to address the issue of where (i.e., on which processor) data is 
located, and how data moves between processors. A straightforward approach is to 
let each processor have space for an entire data vector x of n elements, but in each of 
the two q loops, only access those psize elements of x that the processor is responsible 
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do q = 1 , breakpoint - 1 
L = 2**q 
do row = 0,L/2-l 

weight (row) = EXP((2*pi*i*row)/L) 
end do 

do col' = myid*psize, (myid+l)*psize-l,L 
do row = 0,L/2-l 

c = weight (row) *x(col'+row+L/2) 
d = x(col'+row) 
x(col'+row) = d + c 
x(col'+row+L/2) = d - c 
end do 
end do 
end do 
SYNCHRONIZE 
do q = breakpoint, t 
L = 2**q 
do row = 0,L/2-l 

weight (row) = EXP((2*pi*i*row)/L) 
end do 

do col' = 0,n-l,L 

do row = myid,L/2-l ,m 

c = weight (row) *x(col'+row+L/2) 
d = x(col'+row) 
x(col'+row) = d + c 
x(col'+row+L/2) = d - c 
end do 
end do 
end do 

Fig. 5.1. Partial Per-Processor Parallel Code Template Obtained from Plan of Figure\4-3\ 



for. To emphasize that space for all n elements of a vector x is allocated on each 
processor, we refer to this space on a given processor p as Xp. 

For a distributed architecture or shared memory architecture with substantial 
local memories, the "current" value of each data element will reside on the processor 
that computes its new value, i.e., is responsible for it. We now consider this assignment 
of each element from the data array to the processor that is responsible for it. To start, 
consider the first q loop in Figure ISTTl Since q ranges between 1 and breakpoint— I, 
with breakpoint= log2(psize) + l and L = 2"^, we can conclude that L ranges between 2 
and psize. For each processor p, consider the data elements accessed in each iteration 
of this first q loop. These are data elements x{col' + row) and x{col' + row + L/2), 
where col' ranges between p * psize and (p + 1) * psize — 1 in steps of L, and row 
ranges between and L/2 — 1. Since L < psize for each iteration of the q loop, 
the accessed data elements are psize consecutive elements, beginning with element 
x{p*psize). These elements can be described as x(j)*psize : {{p+ 1) *psize) — 1:1). 
This is a block distribution of the responsibility for array x, corresponding to the block 
approach to the first p loop, as described in Section l473l 

Now consider the second q loop in Figure 15.11 In each iteration of this q loop, 

21 



L/2, > psize. Since psize > m, we can conclude that L/2 > m. For each processor 
p, consider the data elements accessed in each iteration of the q loop. These are data 
elements x{col' + row) and x[col' + row + L/2), where coV ranges between and 
n — 1, in steps of L, and row ranges between p and L/2 — 1, in steps of m. Since 
L/2 is a multiple of to, so is L, so that coV = mod to for every value of col' . Also, 
row= p mod m. Consequently, coV + row and col' + row + L/2 are each congruent to 
p mod to. Thus, every data element accessed by processor p is congruent to p mod 
TO. Now, note that there are n/L iterations of the col' loop. Because L/2 > to, for 
each p and col', there are L/(2m) iterations of the row loop. Each of these iterations 
does the butterfly operation on two data elements of array x. Note that 

n L n 

— * * 2 — — = psize. 

L 2m TO 

Thus for each iteration of the q loop, processor p does an assignment to psize data 
elements. We now note that these are distinct elements, since for each value of row 
and col', different data elements are accessed. In conclusion, the elements accessed 
by processor p during each iteration of the loop are the set of elements x{i) such that 
i = p mod TO. Using Fortran90-like start-stop-stride notation these elements can be 
described as x{p : n — 1 : to). This is a cyclic distribution of the responsibility for 
array x, corresponding to the cyclic approach to the second p loop, as described in 
Section [131 

5.1.3. Construction of Messages. We now consider the construction of mes- 
sages between processors for distributed computation. This entails both bundling of 
data values into messages, and having both the sender and recipient of each message 
specify which of its data elements are included in each message. 

The data can be initially stored in various ways, such as on disks, distributed 
among the processors, or all on one processor. Each of these cases can be readily 
handled. Here, we envision that the data is initially centralized on one processor. 

In describing the details of message passing, we assume that the numbering of 
processors from to to — 1 is such that initially all the data is on processor 0. Before 
the first q loop, as per the block approach, the initially centralized data must be 
distributed among the processors. The processor with all the data must send every 
other processor, otherp, the set of psize contiguous data elements, beginning with data 
element x{psize * otherp), and each such processor must receive these data elements. 
The recipient processor then stores the received psize data values in the appropriate 
position in its x vector. 

A high-level description of message-passing code to perform the block distribution 
is shown in Figure [^21 Since this code refers to the variable myid, the same code can 
be used on each processor. Processor will execute the otherp loop, sending a message 
to each of the other processors. Each processor other than processor will execute a 
RECEIVE command. The SEND command sends a message, and has three parameters: 
the first element to be sent, the number of elements to send, and the processor which 
should receive the message. The RECEIVE command receives a message, and has three 
parameters: where to place the first element received, the number of elements to 
receive, and the processor which sent the message to be received. In Figure [5?2l we 
refer to the local data array as Xmyid, to emphasize that it is the local space for the 
data array that is the source and destination of data sent and received by a given 
processor. 

We assume that the SEND command is nonblocking, and uses buffering. Thus, 
the contents of the message to be sent is copied into a buffer, at which point the 
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statement after the SEND command can be executed. We assume that the RECEIVE 
command is blocking. Thus, the data is received and stored in the locations specified 
by the command, at which point the statement after the RECEIVE command can be 
executed. 

! DISTRIBUTE_CENTRALIZED_TO_BLOCK(x™yid,ni,psize) 
if myid = then 

do otherp = l,m-l 

SEND (x„iyid (psize*otherp) , psize , otherp) 

end do 
else 

RECEIVE (x„jj^,;£;(psize*my id) , psize ,0) 
endif 

Fig. 5.2. Centralized to Block Distribution 

Now consider the block to cyclic redistribution that occurs after the first q loop, 
and before the second q loop. Recall that each processor has psize data elements. 
Of the psize elements on any given processor before the redistribution, psize /m must 
wind up in each of the processors (including itself) after the redistribution. No message 
need be sent for the data values that are to wind up on the same processor as before 
the redistribution. Thus, each processor must send psize /m elements to each of the 
other processors. A high-level description of message-passing code to perform the 
block to cyclic redistribution is shown in Figure 15.31 We assume that the SEND and 
RECEIVE commands are flexible enough so that the first parameter can specify a set of 
array elements using notation specifying start, stop, and stride. The SEND commands 
can be nonblocking because for any given processor, disjoint sets of data elements 
are involved in any pair of message commands on that processor. An alternative to 
Figure ESI is to first do all the SENDs, and then do all the RECEIVES. 

! DISTRIBUTE_BLOCK_TD_CYCLIC(x,„j,id,m, psize) 
do otherp = 0,m-l 

if otherp ^ myid then 

SEND (xrnyid (psize*myid+otherp : psize* (myid+1) -1 : m) ,psize/m , otherp) 
RECEIVE (xmj,i(;(psize*otherp+myid: psize* (otherp+1) -1 :m) ,psize/m, otherp) 
endif 
end do 

Fig. 5.3. Block to Cyclic Redistribution 

Another alternative to Figure 15.31 is to first collect all the data at the centralized 
site, i.e. processor 0, and then do a centralized to cyclic redistribution. This strategy 
is described in more detail in Figure 15. 4[ where the block to cyclic redistribution is 
done as block to centralized redistribution, followed by a centralized to cyclic redis- 
tribution. The approach in Figure [53] entails only 2(m — 1) messages, in contrast 
to the m{m — 1) messages entailed in Figure [531 However, the centralized site may 
become a bottleneck. Moreover, data elements are transmitted twice, to and from the 
centralized site, rather than only once. 

We envision that at the end of the FFT computation all the data elements are col- 
lected at the centralized processor, processor 0. A high-level description of message- 
passing code to perform this cyclic to centralized redistribution is shown in Figure [531 
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! DISTRIBUTE_BLDCK_TQ_CYCLIC(x,„yid,m,psize) 

DISTRIBUTE_BLOCK_TD_CENTRALIZED(xm,yzc;,m,psize) 

DISTRIBUTE_CENTRALIZED_TD_CYCLIC(xmyid,m,psize,ii) 

! DISTRIBUTE_BLOCK_TO_CENTRALIZED {yLmyid , m , psize) 
if myid = then 

do otherp = l,m-l 

RECEIVE (xmyid (psize*otherp) , psize , otherp) 

end do 
else 

SEND(xmyirf(psize*myid) , psize, 0) 
endif 

! DISTRIBUTE_CENTRALIZED_TO_CYCLIC Umyid , m , psize , n) 
if myid = then 

do otherp = l,m-l 

SEND (x„jyid (otherp :n-l :m) , psize , otherp) 

end do 
else 

RECEIVE (xmj,id (myid :n-l :m) , psize, 0) 
endif 

Fig. 5.4. Block to Cyclic Redistribution Using Centralized Site as Intermediary 

! DISTRIBUTE_CYCLIC_TO_CENTRALIZED (x„,yid , m , psize , n) 
if myid = then 

do otherp = l,m-l 

RECEIVE(xmyirf(otherp:n-l :m) , psize , otherp) 

end do 
else 

SEND (xmyid (myid :n-l :m) , psize, 0) 
endif 

Fig. 5.5. Cyclic to Centralized Redistribution 

5.1.4. Per— Processor Distributed Code Template. Figure shows how 
the partial parallel code template from Figure [5TT] is transformed into a template that 
specifies where data is located and how it moves between processors. In Figure 15.61 
we refer to the arrays involved in the template as Xmyid and weight^iyid, to emphasize 
that space local to a given processor is being accessed by the per-processor template. 
We assume that scalar variables, such as q and L are understood to be local variables 
within each processor. As discussed in Section 15.1.21 the distributed code template 
will perform a data redistribution from the initial centralized distribution to a block 
distribution, followed by the first q loop, followed by a block to cyclic redistribution, 
followed by the second q loop (which begins with q equal to breakpoint), followed by a 
cyclic to centralized redistribution. The code in Figure [5?6l is executed on each of the 
TO processors. The DISTRIBUTE_BLOCK_TO_CYCLIC command in Figure serves as 
the mechanism implementing the SYNCHRONIZE command in Figure ISTTl The block to 
cyclic redistribution can be performed either as described in Figure [5T51 or as described 
in Figure In either case, the blocking RECEIVES do the synchronization. 
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DISTRIBUTE_CENTRALIZED_TO_BLOCK(x™yid,m,psize) 
do q = 1, breakpoint - 1 

L = 2**q 

do row = 0,L/2-l 

weightnij^idCrow) = EXP( (2*pi*i*row)/L) 

end do 

do col' = myid*psize, (myid+l)*psize-l,L 
do row = 0,L/2-l 

c = weightmyid(row)*x„ij,,,j(col'+row+L/2) 
d = x,„yjd(col'+row) 
Xmyid (col'+row) = d + c 
Xmj/id (col'+row+L/2) = d - c 
end do 
end do 
end do 

DISTRIBUTE_BLOCK_TO_CYCLIC (x^yid , m , psize) 
do q = breakpoint, t 

L = 2**q 

do row = 0,L/2-l 

weightmyidCrow) = EXP( (2*pi*i*row)/L) 

end do 

do col' = 0,n-l,L 

do row = myid,L/2-l ,in 

c = weight„yi(i(row)*Xmj,i(;(col'+row+L/2) 
d = x„yid(col'+row) 
Xmi/id (col'+row) = d + c 
Xmi/id (col'+row+L/2) = d - c 
end do 
end do 
end do 

D I STRIBUTE_CYCL I C_TQ .CENTRALIZED (x^yid , m , psize , n) 

Fig. 5.6. Per-Processor Distributed Code Template Based on Partial Template of Fiaure \5. 1\ 



5.2. Distributed Template with Partitioned Data and Contiguous Ac- 
cess. 

5.2.1. Development of Template from Plan Using Partitioned Data. 
Figure 15.71 shows a per-processor template obtained from Figure I4.9[ s parallel eom- 
bined computation plan using partitioned data. The template in Figure 15.71 assumes 
that all n data elements are initially located on processor and stored in array xq. It 
also assumes that at the end of the algorithm, all n data elements are to be collected 
together on processor and stored in array xq. Each processor p is assumed to have 
its own local arrays weighty, xblockp, weightcycliCj,, and xcycliCp. The itera- 
tion over the p loops in Figure 14.91 is replaced by the execution of the per-processor 
template in Figure [5771 on each of the m processors. 

A high-level description of message-passing code to perform the four data redistri- 
bution commands in Figure [5?7l (centralized-to-block-partitioned. block-partitioned- 
to-centralized, centralized-to-cyclic-partitioned, and cyclic-partitioned-to-centralized) 
is shown in Figure 
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DISTRIBUTE_CENTRALIZED_TO_BLOCK_PARTITIONED(xo,xblock,„yid,m,psize) 
do q = 1, breakpoint - 1 

L = 2**q 

do row = 0,L/2-l 

weightmyidCrow) = EXP((2*pi*i*row)/L) 

end do 

do col" = 0,psize-l,L 
do row = 0,L/2-l 

c = weight,„j^id(row)*xblock„jyi(i(col"+row+L/2) 
d = xblock„jj,id(col"+row) 
xblockmyid (col"+row) = d + c 
xblock„iyid (col"+row+L/2) = d - c 
end do 
end do 
end do 

DISTRIBUTE_BLOCK_PARTITIDNED_TO_CENTRALIZED(xblock,„yid,xo,m,psize) 
DISTRIBUTE_CENTRALIZED_TD_CYCLIC_PARTITIONED(xo,xcyclic„jj,,;<j,m,psize) 
do q = breakpoint, t 
L = 2**q 

do row' = 0,L/(2*m)-l,l 

weiglitcycliCmyid(row') = EXP( (2*pi*i* (m*row'+myid) ) /L) 
end do 

do col" = 0,psize-l,L/m 
do row' = 0,L/(2*m)-l,l 

c = weightcyclic„jj,i^(row')*xcyclic„jj,i(i(col"+row'+L/(2*m)) 
d = xcyclic,„yid(col"+row') 
xcyclic„ij^,;£;(col"+row') = d + c 
xcyclic„ij,i£;(col"+row'+L/(2*m) ) = d - c 
end do 
end do 
end do 

DISTRIBUTE_CYCLIC_PARTITIDNED_TQ_CENTRALIZED(xcyclic™yirf,xo,in,psize) 

Fig. 5.7. Per-Processor Distributed Code Template Obtained from Combined Plan Using Par- 
titioned Data of Figure \4-9\ 

5.2.2. Eliminating Bottleneck in Block to Cyclic Redistribution. In Fig- 
ure [5?7l the pair of data redistribution commands 

DISTRIBUTE_BLOCK_PARTITIDNED_TO_CENTRALIZED(xblock,„j,id,Xo,m,psize) 
DISTRIBUTE_CENTRALIZED_TD_CYCLIC_PARTITIONED(xo,xcyclic„jj,,;<j,m,psize) 

occurring between the two q loops, together accomplish a block-partitioned-to-cyclic- 
partitioned data redistribution. This two-command sequence can be replaced by the 
single command 

DlSTRlBUTE_BLOCK_PARTlTIDNED_TO_CYCLIC_PARTITIONED(xblockmyid,xcyclic„yjd,ni,psize) 

This combined command can be implemented by doing a block-partitioned-to-centralized 
redistribution followed by a centralized-to-block-partitioncd redistribution, as indi- 
cated in Figure [?771 in a manner analogous to that shown in Figure !?^ Alternately, 
the block-partitioned-to-centralized redistribution can be done directly, as shown in 
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! DISTRIBUTE_CENTRALIZED_TO_BLOCK_PARTITIONED(xo,xblock,„yid,m,psize) 
if myid = then 

xblockmyid = xo(0:psize-l) 

do otherp = l,m-l 

SEND (xq (psize*otherp) , psize , otherp) 

end do 
else 

RECEIVE (xhlockmyid , ps ize , 0) 
endif 

! DISTRIBUTE_BLOCK_PARTITIONED_TO_CENTRALIZED(xblock™yjd,xo,m, psize) 
if myid = then 

xq (0 :psize-l) = xhloc'krnyid 

do otherp = l,m-l 

RECEIVE (xo(psize*otherp) , psize , otherp) 

end do 
else 

SEND (xblockniyjd, psize , 0) 
endif 

! DISTRIBUTE_CENTRALIZED_TQ_CYCLIC_PARTITIONED(xo,xcyclic„yid,m, psize) 
if myid = then 

xcyclic„iyid = xo(0:n-l:m) 

do otherp = l,m-l 

SEND (xq (otherp :n-l :m) , psize , otherp) 

end do 
else 

RECEIVE (xcyclic„ij,i(i> psize ,0) 
endif 

! DISTRIBUTE_CYCLIC_PARTITIDNED_TD_CENTRALIZED(xcyclic„iyid,xo,m, psize) 
if myid = then 

xo(0:n-l:m) = xcyclic„yid 

do otherp = l,m-l 

RECEIVE(xo(otherp:n-l :m) , psize , otherp) 

end do 
else 

SEND (xcycllCmyid > psize , 0) 
endif 

Fig. 5.8. Code for Distribution Commands Occurring in Figure [5. T\ 's Per-Processor Template 
Using Partitioned Data 



Figure [5T9l which is analogous to Figure [5T3l The direct implementation removes the 
centralized site as a bottleneck, but increases the number of messages from 2(m — 1) 
to m{m — 1), as discussed in Section [5.1.31 



6. Development of All— Processor Templates for Shared Memory Ar- 
chitectures vifith Local Memory Availability. 
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! DISTRIBUTE_BLOCK_PARTITIQNED_TD_CYCLIC_PARTITIONED(xblock™yid, 

xcyclicmyid,m,psize) 
xcycliCmyid(niyid*psize/m: (myid+1) *psize/ni-l) = 
xblock„iy,;ci(niyid:psize-l :m) 
do otherp = 0,m-l 

if otherp ^ myid then 

SEND (xblockmjjid (otherp :psize-l :m) ,psize/m, otherp) 
RECEIVE (xcyclic„iy,;£;(otherp*psize/m) ,psize/m, otherp) 
endif 
end do 

Fig. 5.9. Block-Partitioned to Cyclic-Partitioned Redistribution 

6.1. Shared Memory Template. The parallel computation plan of Figure 
can be converted into a shared memory template by converting each of the two p loops 
into a parallel loop, as shown in Figure [Ol The template in Figure [6T] is expressed in 
a style roughly based on OpenMP. We assume that there is a thread for each iteration 
of a parallel do loop, and that each thread can have data that is private to that thread. 
Moreover, we assume that each thread is assigned to a processor (with the possibility 
that multiple threads are assigned to any given processor), and data that is private 
to a given thread is stored in the local memory of the processor for that thread. In 
the template, we use the notational convention that x is a shared data array; and 
that scalar variables, such as q, are private to each thread. We also assume that each 
thread p has a private array weighty for the weights used in the execution of that 
thread. Within each thread, the assignments to scalar variables and to elements of 
weightp modify private data. The assignment to elements of shared array x modify 
shared data, but because we are starting with the computation plan of Figure 14. 3[ 
the elements of x accessed by the various threads of a given parallel do loop are 
disjoint. 

6.2. Shared Memory Template Using Private Local Memory. In Fig- 
ure 16. H all the accesses to x are to a shared data vector. The template can be 
modified so that the butterfly operates on private data, as shown in the all-processor 
template of Figure 16.21 which uses both a shared memory array x, and a private 
memory array Xp. Figure 1^751 shows the details of how the data copying is done. The 
OpenMP-style all-processor template uses assignment statements within each thread 
to copy data between shared memory and the private memory of that thread. 

Figure [6?2l is an all-processor analogue of the per-processor distributed code tem- 
plate shown in Figure ISTBl The data copying via assignment statements in Figure [6?3l 
is analogous to the data copying via message-passing in Figures 15. 2[ 15. 4[ and 15.51 
An OpenMP-style all-processor template permits a given thread to copy data be- 
tween shared memory and private memory of that thread, but does not permit direct 
copying of data from the private memory of one thread to the private memory of 
a different thread. Accordingly, the DISTRIBUTE_BLOCK_TO_CYCLIC command from 
Figure 15.61 must be done as two commands (private block to centralized at the end 
of the first parallel do, followed by centralized to private block at the beginning 
of the second parallel do) in Figure 16. 2[ corresponding to the data movement in 
Figure 15. 4i rather than that in Figure 15.31 

6.3. Shared Memory Template Using Partitioned Data and Contigu- 
ous Access. Figure IHTH shows an OpenMP-style all-processor template for a shared 
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parallel do p = 0,ni-l 
do q = 1, breakpoint - 1 
L = 2**q 
do row = 0,L/2-l 

weightp(row) = EXP((2*pi*i*row)/L) 
end do 

do col' = p*psize , (p+1) *psize-l ,L 
do row = 0,L/2-l 

c = weightp(row)*x(col'+row+L/2) 
d = x(col'+row) 
x(col'+row) = d + c 
x(col'+row+L/2) = d - c 
end do 
end do 
end do 
end parallel do 
parallel do p = 0,m-l 
do q = breakpoint, t 
L = 2**q 
do row = 0,L/2-l 

weightp(row) = EXP((2*pi*i*row)/L) 
end do 

do col' = 0,n-l,L 
do row = p,L/2-l,m 

c = weightp(row)*x(col'+row+L/2) 
d = x(col'+row) 
x(col'+row) = d + c 
x(col'+row+L/2) = d - c 
end do 
end do 
end do 
end parallel do 

Fig. 6.1. Simple Shared Memory All-Processor Template Obtained from Plan of Figure\4-3\ 



memory, based on the plan of Figure 14.91 We assume that there is a thread for each 
processor, and data that is private to a given thread is stored in the local memory of 
the processor for that thread. In the template, we use the notational convention that x 
is a shared data array; that weighty, weightcyclicp, xblockp, and xcyclicp denote 
private data arrays for thread p; and that scalar variables are private to each thread. 
We assume that a given thread can access only shared data and its private data. Con- 
sequently, the block-to-cyclic redistribution cannot be done by moving data directly 
from a private array in one thread into a private array in a different thread. In Fig- 
ure [6]4l the block-to-cyclic redistribution is done in two steps. In the first step, each 
thread of the first p loop does a block-to-centralized copying of data from its private 
array xblockp into shared array x. Consequently, when the first p loop concludes, the 
data from all the private xblockp arrays have been copied into shared array x. In the 
second step, each thread of the second p loop does a centralized-to-block copying of 
data from x into its private array xcyclicp. 
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parallel do p = 0,m-l 

COPY_CENTRALIZED_TO_PRIVATE_BLOCK (x , Xp , p , psize) 
do q = 1, breakpoint - 1 
L = 2**q 
do row = 0,L/2-l 

weightp(row) = EXP( (2*pi*i*row)/L) 
end do 

do col' = p*psize, (p+l)*psize-l,L 
do row = 0,L/2-l 

c = weightp(row) *Xp(col'+row+L/2) 
d = Xp(col'+row) 
Xp(col'+row) = d + c 
Xp (col'+row+L/2) = d - c 
end do 
end do 
end do 

COPY_PRI VATE_BLOCK_TD_CENTRALIZED (xp , x , p , ps ize ) 
end parallel do 
parallel do p = 0,m-l 

COPY_CENTRALIZED_TO_PRIVATE_CYCLIC (x , Xp , p , m , psize ) 
do q = breakpoint, t 
L = 2**q 
do row = 0,L/2-l 

weightp(row) = EXP( (2*pi*i*row)/L) 
end do 

do col' = 0,n-l,L 
do row = p,L/2-l,m 

c = weightp(row) *Xp(col'+row+L/2) 
d = Xp(col'+row) 
Xp(col'+row) = d + c 
Xp(col'+row+L/2) = d - c 
end do 
end do 
end do 

COPY_PRI VATE_CYCLIC_TD_CENTRALIZED (xp , x , p , m , ps ize ) 
end parallel do 

Fig. 6.2. All-Processor Shared and Private Memory Template Based on Template of Fiaure \6.1\ 



Figure [131 shows the details of the assignment statements that do the data copying 
between shared memory and the private memory of each thread. 

7. Preliminary Experimental Results. There are various commercial and 
vendor specific libraries which include the FFT. The Numerical Analysis Group (NAG) 
and Visual Numerics (IMSL) provide and support finely tuned scientific libraries spe- 
cific to various HFC platforms, e.g SGI and IBM. The SGI and IBM libraries, SCSL 
and ESSL, are even more highly tuned, due to their knowledge of proprietary infor- 
mation specific to their respective architectures. We ran experiments comparing our 
code to that in some of these libraries. We also did comparisons with the FFTW, 
although these results are harder to interpret because of the planning phase that is 
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! COPY_CENTRALIZED_TO_PRIVATE_BLOCK (x , Xp , p , psize ) 

Xp (psize*p :psize* (p+l)-l : 1) = x(psize*p :psize* (p+l)-l : 1) 

! COPY_PRIVATE_BLOCK_TO_CENTRALIZED (xp , x , p , psize) 
x(psize*p:psize*(p+l)-l : 1) = Xp (psize*p :psize* (p+l)-l : 1) 

! COPY_CENTRALIZED_TO_PRIVATE_CYCLIC (x , Xp , p , m , ps ize) 
Xp(p:ii-1 :m) = x(p:n-l:m) 

! COPY_PRIVATE_CYCLIC_TD_CENTRALIZED (xp , x , p , m , psize) 
x(p:n-l:m) = Xp (p :ii-l :m) 

Fig. 6.3. Data Copying for All-Processor Shared and Private Memory Template of Fiaure \6.S\ 

part of the FFTW. 

Our experiments are reported in [141 1151 151j . Here, we give results from [ISj . 
where more details are available. 

7.1. Experimental Environment. Our experiments were run on two systems: 

1. A SGI/Cray Origin2000 at NCSAI in lUinois, with 48, 195Mhz RIOOOO pro- 
cessors, and 14GB of memory. The LI cache size is 64KB (32KB Icache and 
32 KB Dcache). The Origin2000 has a 4MB L2 cache. The OS is IRIX 6.5. 

2. An IBM SP-2 at the MAUI High Performance Computing CenteiEl, with 32 
P2SC 160Mhz processors, and 1 GB of memory. The LI cache size is 160KB 
(32KB Icache and 128KB Dcache), and there is no L2 cache. The OS is AIX 
4.3. 

We tested the math libraries available on both machines: on the Origin 2000, 
IMSL Fortran Numerical Libraries version 3.01, NAG version Mark 19, and SGI's 
SCSL library; and on the SP-2, IBM's ESSL library We also ran the FFTW on both 
machines. 

7.2. Experiments. Experiments on the Origin 2000 were run using bsub, SGI's 
batch processing environment. Similarly, experiments on the SP-2 were run using the 
loadleveler batch processing environment. In both cases we used dedicated networks 
and processors. For each vector size (2"^ to 2^**), experiments were repeated a minimum 
of three times and averaged. For improved optimizations, vendor compilers were used 
with the -03 and -Inline flags. We used Perl scripts to automatically compile, run, 
and time all experiments, and to plot our results for various problem sizes. 

7.3. Evaluation of Results. Our results on the Origin 2000 and the SP-2, 
as shown in Figures 17.11 and 17.21 and Tables 17.11 and 17.21 indicate a performance im- 
provement over standard libraries for many problem sizes, depending on the particular 
library. 

7.3.1. Origin 2000 Results. Performance results for our monolithic FFT, 
which we call here FFT-UA, indicate a doubling of time when the vector size is 
doubled, for all vector sizes tried. IMSL doubled its performance up to 2^^. At 2^^ 



®This work was partially supported by National Computational Science Alliance, and utilized 
the NCSA SGI/CRAY Origin2000 

®We would like to thank the Maui High Performance Computing Center for access to their IBM 
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parallel do p = 0,m-l 

COPY_CENTRALIZED_TO_PRIVATE_BLOCK_PARTITIONED (x , xblockp , m , psize , p) 
do q = 1, breakpoint - 1 
L = 2**q 
do row = 0,L/2-l 

weightp(row) = EXP( (2*pi*i*row)/L) 
end do 

do col" = 0,psize-l,L 
do row = 0,L/2-l 

c = weightp(row)*xblockp(col"+row+L/2) 
d = xblockp (col"+row) 
xblockp(col"+row) = d + c 
xblockp (col"+row+L/2) = d - c 
end do 
end do 
end do 

COPY_PRI VATE_BLOCK_PARTITIDNED_TO_CENTRALIZED (xblockp , x , p , m , psize , n) 
end parallel do 
parallel do p = 0,m-l 

COPY_CENTRALIZED_TO_PRIVATE_CYCLIC_PARTITIONED(x,xcycliCp,m, psize, p,n) 

do q = breakpoint, t 
L = 2**q 

do row' = 0,L/(2*m)-l,l 

weightcycliCp(row') = EXP((2*pi*i*(in*row'+p))/L) 
end do 

do col" = 0,psize-l ,L/m 
do row' = 0,L/(2*m)-l,l 

c = weightcycliCp(row') *xcycliCp(col"+row'+L/ (2*m) ) 
d = xcycliCp(col"+row') 
xcycliCp(col"+row') = d + c 
xcycliCp(col"+row'+L/(2*m) ) = d - c 
end do 
end do 
end do 

COPY_PRIVATE_CYCLIC_PARTITIDNED_TO_CENTRALIZED(xcycliCp,x,m, psize, p,n) 
end parallel do 

Fig. 6.4. All-Processor Shared Memory and Private Partitioned Data Template Obtained from 
Combined Plan of Figure \4-9\ 

there is a 400% degradation in performance, presumably due to a change in the use 
of the memory hierarchy. For NAG this degradation begins at 2^^. The SGI hbrary 
(SCSL) does not exhibit this degradation. SCSL may be doing machine specific op- 
timizations, perhaps using more sophisticated out of core techniques similar to those 
described by Gormen [53] . as evidenced by nearly identical performance times for 2^^ 
and 218. 

7.3.2. SP-2 Results. FFT-UA outperforms ESSL for vector sizes up to 2'^'^, 
except for two cases. For 2^^ and 2^^, the performance is slightly worse. ESSL does 
increasingly better as the problem size increases. The FFT-UA times continue to 
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! COPY_CENTRALIZED_TO_PRIVATE_BLOCK_PARTITIONED(x,xblockp,p,psize) 
xblockp = x(psize*p :psize* (p+1) -1 : 1) 

! COPY_PRIVATE_BLOCK_PARTITIDNED_TO_CENTRALIZED (xblockp, m,psize,p) 
x(psize*p:psize*(p+l)-l : 1) = xblockp 

! COPY_CENTRALIZED_TO_PRIVATE_CYCLIC_PARTITIONED(x,xcycliCp,m,psize,p,n) 
xcycliCp = x(p:n-l:m) 

! COPY_PRIVATE_CYCLIC_PARTITIONED_TQ_CENTRALIZED(xcycliCp,x,m,psize,p,ii) 
x(p:n-l :m) = xcycliCp 

Fig. 6.5. Data Copying for All-Processor Shared Memory and Private Partitioned Data Tem- 
plate of Figure \6.4\ 

Origin 2000 Performance Improvement 
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Fig. 7.1. Percent improvement o/ FFT-UA over library code and FFTW on Origin 2000 



double from 2^"^ to 2^^ , reflecting the uniformity and simplicity of the code. The 
FFT-UA code is machine-independent, and relies on the efficiency of the compiler 
used. Presumably, the ESSL code is tuned to the machine architecture, using machine 
specific optimizations, and should be expected to perform better. 

The ESSL code failed for problem sizes 2^^ and higher, whereas FFT-UA suc- 
cessfully processed problem sizes through 2^^, four times larger than the maximum 
handled by ESSL. The FFTW code failed for problem sizes 2^^^ and higher. 
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Origin 2000 

Execution Time in Seconds 



Size 


FFT-UA 


IMSL 


NAG 


SCSL 


2^ 


0.019 


0.064 


0.010 


0.065 


24 


0.018 


0.061 


0.010 


0.047 


2'"^ 


0.018 


0.062 


0.010 


0.065 


26 


0.017 


0.116 


0.011 


0.073 


2^ 


0.019 


0.063 


0.010 


0.068 


28 


0.018 


0.062 


0.011 


0.105 


2» 


0.017 


0.122 


0.011 


0.069 


210 


0.021 


0.065 


0.013 


0.056 


2" 


0.021 


0.064 


0.016 


0.058 


212 


0.021 


0.067 


0.023 


0.067 


213 


0.022 


0.075 


0.036 


0.065 


214 


0.024 


0.144 


0.065 


0.066 


215 


0.030 


0.120 


0.135 


0.110 


216 


0.040 


0.209 


0.296 


0.080 


21^ 


0.065 


0.335 


0.696 


0.072 


218 


0.126 


0.829 


3.205 


0.075 


219 


0.238 


3.007 


9.538 


0.096 


220 


0.442 


9.673 


18.40 


0.143 


221 


0.884 


23.36 


38.93 


0.260 


222 


1.910 


46.70 


92.75 


0.396 


223 


4.014 


109.4 


187.7 


0.671 


224 


7.550 


221.1 


442.7 


1.396 



Table 7.1 

Real Execution Times of FFT-UA and Comparative Libraries on Origin 2000 



7.4. Conclusions from Experiments. As evidence of the potential value of 
the uniform program design methodology outlined here, we constructed a machine- 
independent portable solution to the complex problem of faster and bigger FFTs. 
Reproducible experiments indicate that our designs outperform IMSL in all cases, 
NAG for sizes greater than 2^^, SCSL for sizes less than 2^*, and ESSL in some cases. 

Our single, portable, scalable executable of approximately 2,600 bytes also must 
be compared with the large suite of machine-specific software required by NAG, 
IMSL, SCSL, and ESSL. The user of these machine-specific libraries must have a 
deep understanding of the package itself and the system on which it runs. Many of 
these packages require numerous parameters: fifteen for ESSL, and eight for SCSL. 
An incorrect decision for these parameters can result in poor performance and even 
possibly incorrect results. In comparison, a naive user of FFT-UA need only know 
the vector size on any platform. 

8. Conclusions, Extensions, and Future Research. We have outlined a 
semi-mechanical methodology for developing efficient scientific software, centered on 
interactively developed sequences of algorithm transformations. We systematically 
improved the efficiency of the sequential expression of the high-level specification 
of the FFT algorithm, and formulated processor mapping and data decompositions 
strategies. As a phase of this methodology, abstract plans were constructed to specify 
which computations are to be done by each processor. Subsequently, templates were 
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Fig. 7.2. Percent improvement of FFT-UA over ESSL and FFTW on IBM SP-2 
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Execution Time in 


Seconds 




Size 


FFT-UA 


ESSL 


Size 


FFT-UA 


ESSL 


2^ 


0.010 


0.013 


2i4 


0.040 


0.043 


24 


0.010 


0.053 


215 


0.070 


0.060 


25 


0.010 


0.013 


216 


0.140 


0.120 


26 


0.010 


0.013 


217 


0.280 


0.160 


2^ 


0.010 


0.013 


218 


0.580 


0.310 


2« 


0.010 


0.016 


219 


1.216 


0.610 


2° 


0.010 


0.010 


220 


2.580 


1.276 


210 


0.010 


0.013 


221 


5.553 


3.663 


2" 


0.013 


0.010 


222 


12.12 


Failed 


212 


0.020 


0.020 


223 


25.25 


Failed 


213 


0.033 


0.030 









Table 7.2 

Real Execution Times of FFT-UA and ESSL on IBM SP-2 



created attuned to various high-level architectural paradigms, e.g. shared and dis- 
tributed memory multiprocessor computations. Parallel/distributed programs can be 
easily built from templates. Experimental comparisons indicate that our programs can 
be competitive in performance with software that has been hand-tuned to particular 
target architectures. 
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The algorithm variants developed here can be further optimized in several ways, 
consistent with our underlying methodology. Examples include tuning for shared 
memory, retiling to match the memory hierarchy, and better overlap of computation 
and communication. More generally, additional compiler-like analysis, transforma- 
tions, and optimizations can be used to improve performance. This can be done during 
the various phases of the methodology. Finally, our proposed methodology for soft- 
ware development can be enhanced by a script-based mechanism for experimentally 
exploring the efficiency of alternative implementations for different ranges of problem 
sizes and different computing environments. The results of such experiments should 
prove useful in evaluating and guiding successive and/or alternative refinements. 

The approach we have presented is similar in spirit to other efforts using libraries 
of algorithm building blocks based on C+-t- template classes such as POOMA. Based 
on the formalism of A Mathematics of Arrays (MoA) and an indexing calculus (i.e. 
the tp calculus) our approach enables the programmer to develop algorithms using 
high-level compiler-like optimizations through the ability to algebraically compose 
and reduce sequences of array operations. The resulting ONF is a prescription for 
code generation that can be directly translated into efficient code in the language of 
the programmer's choice be it for hardware or software application. 

Appendices: 

Appendix A. Elements of the theory. 

A.l. Indexing and Shapes. The central operation of MoA is the indexing 
function 

ptpA 

in which a vector of n integers p is used to select an item of the n-dimensional array 
A. The operation is generalized to select a partition of A, so that if q has only k <n 
components then 

qi^A 

is an array of dimensionality n — k and q selects among the possible choices for the 
first k axes. In MoA zero origin indexing is assumed. For example, if A is the 3 by 5 
by 4 array 

" 

4 
8 
12 
16 

then 
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< 2 1 > = < 44 45 46 47 > 



< 2 1 3 > 



= 47 



Most of the common array manipulation operations found in languages like For- 
tran90, Matlab, ZPL, etc., can be defined from ip and a few elementary vector oper- 
ations. 

We now introduce notation to permit us to define ip formally and to develop 
the Psi Correspondence Theorem |54|, which is central to the effective exploitation 
of MoA in array computations. We will use A, B, ... to denote an array of numbers 
(integer, real, or complex). An array's dimensionality will be denoted by (Ia and will 
be assumed to be n if not specified. 

The shape of an array A, denoted by sa, is a vector of integers of length dA, each 
item giving the length of the corresponding axis. The total number of items in an 
array, denoted by Ia, is equal to the product of the items of the shape. The subscripts 
will be omitted in contexts where the meaning is obvious. 

A full index is a vector of n integers that describes one position in an n-dimensional 
array. Each item of a full index for A is less than the corresponding item of sa- There 
are precisely Ia indices for an array (due to a zero index origin). A partial index of A 
is a vector of < fc < n integers with each item less than the corresponding item of 

SA- 

We will use a tuple notation (omitting commas) to describe vectors of a fixed 
length. For example, 

< i j k > 

denotes a vector of length three. <> will denote the empty vector which is also 
sometimes written as Q. 

For every n-dimensional array A, there is a vector of the items of A, which we 
denote by the corresponding lower case letter, here a. The length of the vector of 
items is i^- A vector is itself a one-dimensional array, whose shape is the one-item 
vector holding the length. Thus, for a, the vector of items of A, the shape of a is 

Sa = < > 

and the number of items or total number of components c F"l is 

ta = tA- 

The precise mapping of A to a is determined by a one-to-one ordering function, 
gamma. Although the choice of ordering is arbitrary, it is essential in the following 
that a specific one be assumed. By convention we assume the items of A are placed 
in a according to the lexicographic ordering of the indices of A. This is often referred 
to as row major ordering. Many programming languages lay out the items of mul- 
tidimensional arrays in memory in a contiguous segment using this ordering. Fortran 
uses the ordering corresponding to a transposed array in which the axes are reversed 
column major. Scalars are introduced as arrays with an empty shape vector. 

There are two equivalent ways of describing an array A: 
(1) by its shape and the vector of items, i.e. A — {sa, a}, or 



^"We also use ra, Sa, and pa to denote total number of components, dimensionality and shape of 

a. 
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(2) by its shape and a function that defines the value at every index p. 
These two forms have been shown to be formally equivalent [5S]. We wish to use the 
second form in defining functions on multidimensional arrays using their Cartesian 
coordinates (indices). The first form is used in describing address manipulations to 
achieve effective computation. 

To complete our notational conventions, we assume that p,q, ... will be used to 
denote indices or partial indices and that u,v,.. will be used to denote arbitrary 
vectors of integers. In order to describe the ith item of a vector a, either or a[i] 
will be used. If u is a vector of k integers all less than Ia, then a[u] will denote the 
vector of length k, whose items are the items of a at positions Uj, j = 0, fc — 1. 

Before presenting the formal definition of the ^ indexing function we define a few 
functions on vectors: 

u -'r\r V catentation of vectors u and v 

u + V itemwise vector addition assuming — ty 

u X V itemwise vector multiplication 

n + u, u + n addition of a scalar to each item of a vector 

n X u, u X n multiplication of each item of a vector by a scalar 

L n the vector of the first n integers starting from 

TT V a scalar which is the product of the components of v 

k A u when fc > the vector of the first k items of m, (called take) 

and when A: < the vector of the last k items of u 
k \/ u when fc > the vector oi tu — k last items of u, (called drop) 

and when fc < the vector of the first tu ~ \k\ items of u 
k u when fc > the vector of [k \/ u) ^{k A u) 

and when fc < the vector or (fc A u) -H-(fc V u) 
.1. Let A he an n- dimensional array and p a vector of integers. If 



Definition A. 

p is an index of A, 



whe 



pipA = a[7(sA,p)], 



7(syi,p) — Xn-1 defined by the recurrence 

^■j = ^j-i * +Pj, i = 1, ■■■,n- 1. 
If p is partial index of length k < n, 

pijjA = B 

where the shape of B is 

SB = k SJ SA, 

and for every index q of B, 

qi)B = [p -Vrq)i)A 

The definition uses the second form of specifying an array to define the result of a 
partial index. For the index case, the function 7(5, p) is used to convert an index p to 
an integer giving the location of the corresponding item in the row major order list 
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of items of an array of shape s. The recurrence computation for 7 is the one used in 
most compilers for converting an index to a memory address [56) . 
Corollary A. 2. <> ^ A = A. 

The foUowing theorem shows that a ip selection with a partial index can be ex- 
pressed as a composition of ip selections. 

Theorem A. 3. Let A be an n- dimensional array and p a partial index so that 
p = q -H-r. Then 

pip A — r'ip(qipA). 
Proof: The proof is a consequence of the fact that for vectors u,v,w 

[u -Vrv) -VrW = U -Vr{v -Vrw). 

If we extend p to a full index by p then 

p'^{pi})A) = [p -Vrp')ll)A 

= {{q -^r) -Vrp')ipA 

= (g4f(r4fp'))V'^ 
= (r -Vrp')il){q'iljA) 

= p'lp^ril'^q'ipA)) 

pip A = ripi^qijjA) 

which completes the proof. 

We can now use psi to define other operations on arrays. For example, consider 
definitions of take and drop for multidimensional arrays. 

Definition A. 4 (take; A). Let A be an n-dimensional array, and k a non- 
negative integer such that < fc < sq. Then 

k A A = B 

where 

SB =< /O 4f (1 V sa) 

and for every index p of B, 

pipB = pip A. 

(In MoA A is also defined for negative integers and is generalized to any vector 
u with its absolute value vector a partial index of A. The details are omitted here.) 
Definition A. 5 (reverse: $). Let A be an n-dimensional array. Then 

S<i>A = SA 

and for every integer i, < i < Sqj 

< i > ip^A So - i - 1 > ijA. 

This definition of $ does a reversal of the 0th axis of A. 

Note also that all operations are over the 0th axis. The operator fl [55] extends 
operations over all other dimensions. 
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Example. Consider the evaluation of the following expression using the 3 by 5 
by 4 array, A, introduced in Section lA.il 



< 1 2 > V(2 A $A) 



(A.l) 



where A is the array given in the previous section. The shape of the result is 



This process of simplifying the expression for the item in terms of its Cartesian coor- 
dinates is called Psi Reduction. The operations of MoA have been designed so that 
all expressions can be reduced to a minimal normal form [26] . Some MoA operations 
defined by psi are found in Fig lA.ll 

Appendix B. Higher Order Operations. 

Thus far operation on arrays, such as concatenation, rotation, etc., have been 
performed over their 0th dimensions. We introduce the higher order binary operation 
17, which is defined when its left argument is a unary or binary operation and its 
right argument is a vector describing the dimension upon which operations are to be 
performed, or which sub-arrays are used in operations. The dimension upon which 
operations are to be performed is often called the axis of operation. The result of Q 
is a unary or binary operation. 

B.l. Definition of il. ft is defined whenever its left argument is a unary or 
binary operation, / or g respectively (/ and g include the outcome of higher order 
operation), ff's right argument is a vector, d, such that pd =< 1 > or pd =< 2 > 
depending on whether the operation is unary or binary. Commonly, / (or g) will 
be an operation which determines the shape of its result based on the shapes of its 
arguments, not on the values of their entries, i.e. for all appropriate arguments p/^ 
is determined by and p^ig^r is determined by p^i and p£,r- 

Definition 0: ^f2j-is defined when / is a one argument function, d =< a >, with 



2 V 5(2 A $A) 

= 2v(<2>4f(lv s<s,a)) 
= 2 V (<2>^(1 V sa)) 
= 2v(<2> + <54>) 
=2v <254> 
= < 4 > . 



(A.2) 



The expression can be simplified using the definitions: 



< 1 2 > i/;(2 A $A) 

< 1 2 > Tp'^A 

< 2 > V(< 1 > -0*^) 

< 2 > V(< 3 - 1 - 1 > V^) 

< 1 2 > 



(A.3) 



cr > 0. 



For any non-empty array ^, 



(B.l) 
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Symbol 


Name 


Description 





Dimensionality 


Returns the number of dimensions of an array. 


p 


Shape 


Returns a vector of the upper bounds 
or sizes of each dimension in an array. 


i-C 


Iota 


W^hen n — O(scalar), returns a vector containing elements 
0, to — 1. When n — l(vector), returns an 
array of indices defined by the shape vector 




Psi 


The main indexing function of the Psi Calculus 
which defines all operations in MoA. Returns a scalar 
if a full index is provided, a sub-array otherwise. 






\"Cclorizcs a iiiuh i-diiuousioual ai'i'a\" l)asc(l 
on a.ii a.rra^"'^ layoyili", row ■, led • "i .-^par.-^c , ■■■) 


7 


Gamma 


Translates indices into offsets given a shape. 


7 


vjctiiiiiidi xii vd 


X 1 CLllZMCLliOD iJllOdiO llll^vJ IIIUIL^CO veil tX 011Clf_J^C 


s p 




v^iiciiigco viic c»iici_jjc; Vd./Uvji \ji ail txL i cty ^ '-"■j diio^ uiiig 

its dimensionality. Reshape depends on layout(7). 




Pi 


i^(■tu^n^^ a scalai. and l^ o(]un"Mlont to [[ :r[/\ 


T 


T;uL 


HtiLui'ii^ ihc iiiiiiibtir ol coiiipoiiciils in an arra.^>',(ri; = Ti:{pt,)) 




Catenate 


Coiicateiiatcs two arrays over their primary axis. 




Point-wise 
Extension 


A data parallel application of / is performed 
between all elements of the arrays. 




Scalar Extension 


cr is used with every component of in the data parallel 

application of /. 


A 


Take 


Returns a sub-array from the beginning or end of an array 
based on its argument being positive or negative. 


V 


Drop 


The inverse of Take 


op rod 


Reduce 


Reduce an array's dimension b>' one b>' applying 
op o\'or 1 ho priniarv axi^ of au arra\'. 


<I> 


ii,c\'cr^c 


Hi^nxTscs ihc coiiipoiieiits oi an arra^". 


e 


Rotate 


Rotates, or shifts cyclically, components of an array. 




Transpose 


Transposes the elements of an array based on 

a given permutation vector 


n 


Omega 


Applies a unary or binary function to array argument(s) 

given partitioning information. O is used to perform all operations 

(defined over the primary axis only) over all dimensions. 



Fig. A.l. Summary of MoA Operations 



is defined provided {i) (6^) > a, and provided certain other conditions, stated below, 
are met. Let 

u = i-a)vP^- (B.2) 

We can write 

p^ = u-{^z (B.3) 

where z = {—cr) A p^. 

We further require (n) there exists w such that for <* i <* u, 

/(iVO (B.4) 

is defined and has shape w. The notation <* i <* u, is a shorthand which imphes 
that we are comparing two vectors i and u component by component. With this 

p{fng)^ = u^w (B.5) 

and for <* i <* u, 

MfQ^)^ ^ fii^l^O (B.6) 

End of Definition 
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Note that condition (ii) is easily satisfied for common /'s. 

Definition 1: We similarly define when its function argument is a binary opera- 
tion g. gfijis defined when 17 is a two argument function, d =< ai (Jr >, with a; > 0, 
and CTr > 0. 

For any non-empty arrays, and ^r, 

^l{g^d)^r (B.7) 

is defined provided (i) ((5^/) > ai and {S£_r) > <^r, and provided certain other condi- 
tions, stated below, arc met. 

We let L denote the binary operation minimum and let 

m=mi)-ai)[mr)-(Tr). (B.8) 

We require that {ii) ((— m) A (— cr;) v P^l) = ((~"^) ^ {~^r) V PCr)- 
Let 

x = {{-m) A (-aO V P6) = ((-m) A (-a^ v P^r), (B.9) 

u= (-to) V {-(^i) V P^i, (B.IO) 

w = (-m) V (-o-r) V P^r- (B.ll) 
Note that u=<>oru=<> (both could be empty). We can now write 

p^i=u^x^y, (B.12) 

and, 

p^r = v-H-x^z (B.13) 

where y = {—cri) A p^i and z = (— Cr) A p^^- Any of the vectors above could be 
empty. 

We also require {Hi) there exists a fixed vector w such that for <* i <* u, 
0<*j<*v,0<*k <* X, 

{0^k)^Mij^k)i;^r) (B.14) 

is defined and has shape w. 
With aU this 

P{^l{g^d)^r) = u-H-v-H-x-H-w (B.15) 

and for <* z <* w, <* J <* v, <* k <* x, 

{t^3^k)^ii{a^i)^r = {{^^k)ij^i)9{{j^k)ij^r) (B.16) 
End of Definition 1 
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